This document explores a dataset of white wine attributes along with a quality score. The inputs include objective tests such as PH values and alcohol % by volume. The output is based on sensory data, which is the median of at least 3 evaluations made by wine experts. The scale used by each expert who graded the wine quality was between 0 (very bad) and 10 (very excellent).
The dataset is related to white variants of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
For more information, see the citations at the end of this document.
library(ggplot2)
library(tidyr) # for the Univariate Matrix Plots (gather command specifically)
library(gridExtra) # for multiple plots in one
library(grid) # for adding title/text to gridExtra plots (textGrob specifically)
library(reshape2) # for melting and the heatmap/correlation plot
library(tidyverse)
library(ggExtra) # for ggMarginal
# Load the Data
ww <- read.csv('wineQualityWhites.csv')
In this section I am looking to get a basic understanding of the data I’ll be working with.
# Dimensions of the dataset
dim(ww)
## [1] 4898 13
The data consists of 13 variables with just under 4900 observations.
# Top few rows of data
head(ww)
# Structure of the dataset
str(ww)
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
# Summary of each variable
summary(ww)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
A cursory look at the summary of the variables tells me there are a number of variables with outliers.
# Checking for null values
colSums(is.na(ww))
## X fixed.acidity volatile.acidity
## 0 0 0
## citric.acid residual.sugar chlorides
## 0 0 0
## free.sulfur.dioxide total.sulfur.dioxide density
## 0 0 0
## pH sulphates alcohol
## 0 0 0
## quality
## 0
There are no null values in the dataset.
I am adding two new variables to the dataset. The first one is to duplicate the quality column and change the data type to an ordered factor data type; this is called ‘score’. The second one is to create a new variable called ‘rating’.
Rating Explained: The wine experts gave a quality score for each wine on a scale from 0 (very bad) to 10 (very excellent). However, the values in the quality column range from 3 through 9, meaning there are only 7 possible values. So, the rating column will also be an ordered factor data type, but only contain values 3 through 9.
ggplot(ww, aes(x = quality)) +
geom_bar(stat = "count", fill = 'dodgerblue4') +
scale_y_continuous(limits = c(0, 2500)) +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
scale_x_continuous(breaks = seq(3, 9, 1)) +
labs(title = 'Quality Counts') +
theme(
plot.title = element_text(size=10, face="italic"))
When looking at the distribution for quality, I didn’t think it made sense to combine scores of 5, 6, and 7 together and call them average because that would encompass almost 93% of the data, and that much can’t all be average. Therefore, I decided to use the following scale:
This makes a score of 6 the average and allows for more diversity later when faceting. It also makes logical sense that it is likely more difficult to achieve a high quality wine, so the count above the average is lower than the count below.
First, I add the columns and set the order.
# Create a new variabile for quality and change the datatype
ww$score = ww$quality
ww$score = factor(ww$score, ordered = TRUE)
# Create a new variable for rating
ww$rating <- ifelse(ww$score < 5, 'very bad',
ifelse(ww$score < 6, 'bad',
ifelse(ww$score < 7, 'average',
ifelse(ww$score < 8, 'good', 'very good'))))
ww$rating <- ordered(ww$rating, levels = c('very bad', 'bad',
'average', 'good', 'very good'))
Then, check the results after creating the columns.
table(ww$rating)
##
## very bad bad average good very good
## 183 1457 2198 880 180
ggplot(ww, aes(x = rating)) +
geom_bar(aes(fill = score), stat = "count") +
scale_fill_manual(
values=c("red4", "red", "orange", "springgreen3",
"dodgerblue", "blue", "magenta")) +
geom_text(stat='count', aes(label=..count..),
vjust=1.5, color = 'white', fontface = 'bold') +
labs(title = 'Rating Faceted by Score') +
theme(
plot.title = element_text(size=10, face="italic"))
str(ww)
## 'data.frame': 4898 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ score : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
## $ rating : Ord.factor w/ 5 levels "very bad"<"bad"<..: 3 3 3 3 3 3 3 3 3 3 ...
# Shows the order of an ordered factor variable
unique(ww$score)
## [1] 6 5 7 8 4 3 9
## Levels: 3 < 4 < 5 < 6 < 7 < 8 < 9
# Shows the order of an ordered factor variable
unique(ww$rating)
## [1] average bad good very good very bad
## Levels: very bad < bad < average < good < very good
The structure shows the two new variables at the bottom.
The dataset contains the following variables. An explanation of each is provided to help the reader understand the data and results.
Note: dm^3 stands for Cubic decimetre, a volume unit equivalent to a litre.
Output variable (based on sensory data):
The first thing I want to do is get a quick look at all the data in one place. For that, I am creating histogram and density plot matrices.
# Remove the 'X' column for plotting purposes
no_x <- subset(ww, select = -X)
# Creates the histogram matrix using only numeric variables
no_x %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill = 'dodgerblue4') +
labs(title = 'Histogram Matrix') +
theme(
plot.title = element_text(size=12, face="italic"))
# Creates the density matrix using only numeric variables
no_x %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density(fill = 'dodgerblue4') +
labs(title = 'Density Matrix') +
theme(
plot.title = element_text(size=12, face="italic"))
The matrices show a mix of results. Some plots loosely resemble a normal distribution, but many are bimodal or multimodal. Almost all of them have a right skew to a lesser or greater extent. Now I want to look at individual variables.
The alcohol variable represents the percent alcohol content of the wine.
catal1 <- ggplot(ww, aes(x = alcohol)) +
geom_histogram(binwidth = .1, color = 'black', fill = 'steelblue1') +
scale_x_continuous(breaks = seq(8, 14, .5)) +
geom_vline(aes(xintercept = mean(alcohol)), col='red', size = 1) +
labs(title = 'Alcohol Histogram w/Mean') +
theme(
plot.title = element_text(size=12, face="italic"))
catal2 <- ggplot(ww, aes(x = alcohol)) +
geom_histogram(aes(y=..density..),
binwidth = .1, color = 'black', fill = 'steelblue1') +
geom_density(alpha=.2, fill="maroon1") +
scale_x_continuous(breaks = seq(8, 14, .5)) +
geom_vline(aes(xintercept = mean(alcohol)), col='red', size = 1) +
labs(title = 'Alcohol Histogram + Density Plot Overlay w/Mean') +
theme(
plot.title = element_text(size=12, face="italic"))
catal3 <- ggplot(ww, aes(x = "", y = ww$alcohol)) +
geom_boxplot(fill = 'steelblue1') +
scale_y_continuous(breaks = seq(8, 14, .5)) +
coord_flip() +
xlab('alcohol') +
ylab('count') +
geom_hline(aes(yintercept = mean(alcohol)), col='red', size = 1) +
labs(title = 'Alcohol Boxplot w/Mean') +
theme(
plot.title = element_text(size=12, face="italic"))
grid.arrange(catal1, catal2, catal3, ncol = 1)
summary(ww$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol Charts Summary: The alcohol attribute histogram shows there is quite a lot of variation in the heights of the bars, which made it a little difficult to understand the distribution. The density plot overlay shows a right-skewed, multimodal distribution. The boxplot shows the interquartile range, and there are no outliers.
# Gets the first quantile
alq1 <- quantile(ww$alcohol, 0.25)
alq1
## 25%
## 9.5
# Gets the third quantile
alq3 <- quantile(ww$alcohol, 0.75)
alq3
## 75%
## 11.4
# Calculates the IQR
alqr <- IQR(ww$alcohol)
alqr
## [1] 1.9
# Calculating lower inner outlier fence
al_low_inn <- 9.5 - (alqr * 1.5)
al_low_inn
## [1] 6.65
# Calculating upper inner outlier fence
al_upp_inn <- 11.4 + (alqr * 1.5)
al_upp_inn
## [1] 14.25
# Calculating lower outer oulier fence
al_low_out <- 9.5 - (alqr * 3)
al_low_out
## [1] 3.8
# Calculating upper outer oulier fence
al_upp_out <- 11.4 + (alqr * 3)
al_upp_out
## [1] 17.1
# Calculates the number of outliers below the inner lower fence (mild outliers)
sum(ww$alcohol < al_low_inn)
## [1] 0
# Calculates the number of outliers above the inner upper fence (mild outliers)
sum(ww$alcohol > al_upp_inn)
## [1] 0
# Calculates the number of outliers above the outer upper fence (extreme outliers)
sum(ww$alcohol > al_upp_out)
## [1] 0
# Calculates the value for the 99th percentile
alninetynine <- quantile(ww$alcohol, 0.99)
# Calculates the number of values greater than the 99th percentile
sum(ww$alcohol > alninetynine)
## [1] 41
Total sulfur dioxide is the amount of free and bound forms of SO2; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
cattsd1 <- ggplot(ww, aes(x = total.sulfur.dioxide)) +
geom_histogram(binwidth = 5, color = 'black', fill = 'steelblue1') +
scale_x_continuous(breaks = seq(0, 440, 25)) +
geom_vline(aes(xintercept = mean(total.sulfur.dioxide)), col='red', size = 1)+
labs(title = 'total.sulfur.dioxide Histogram w/Mean') +
theme(plot.title = element_text(size=12, face="italic"))
cattsd2 <- ggplot(ww, aes(x = total.sulfur.dioxide)) +
geom_histogram(aes(y=..density..),
binwidth = 5, color = 'black', fill = 'steelblue1') +
geom_density(alpha=.2, fill="maroon1") +
scale_x_continuous(breaks = seq(0, 440, 25)) +
geom_vline(aes(xintercept = mean(total.sulfur.dioxide)), col='red', size = 1)+
labs(title = 'total.sulfur.dioxide Histogram + Density Plot Overlay w/Mean') +
theme(plot.title = element_text(size=12, face="italic"))
cattsd3 <- ggplot(ww, aes(x = "", y = ww$total.sulfur.dioxide)) +
geom_boxplot(fill = 'steelblue1') +
scale_y_continuous(breaks = seq(0, 440, 25)) +
coord_flip() +
xlab('total.sulfur.dioxide') +
ylab('count') +
geom_hline(aes(yintercept = mean(total.sulfur.dioxide)), col='red', size = 1)+
labs(title = 'total.sulfur.dioxide Boxplot w/Mean') +
theme(plot.title = element_text(size=12, face="italic"))
grid.arrange(cattsd1, cattsd2, cattsd3, ncol = 1)
summary(ww$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Total Sulfur Dioxide Charts Summary: The total sulfur dioxide attribute has a fairly normal distribution with a right skew due to outliers, though the density plot overlay seems mildly multimodal.
These are calculated in order to be used later with bivariate and multivariate plots.
# Gets the first quantile
sdq1 <- quantile(ww$total.sulfur.dioxide, 0.25)
sdq1
## 25%
## 108
# Gets the third quantile
sdq3 <- quantile(ww$total.sulfur.dioxide, 0.75)
sdq3
## 75%
## 167
# Calculates the IQR
sdqr <- IQR(ww$total.sulfur.dioxide)
sdqr
## [1] 59
# Calculating lower inner outlier fence
sd_low_inn <- 108 - (sdqr * 1.5)
sd_low_inn
## [1] 19.5
# Calculating upper inner outlier fence
sd_upp_inn <- 167 + (sdqr * 1.5)
sd_upp_inn
## [1] 255.5
# Calculating lower outer oulier fence
sd_low_out <- 108 - (sdqr * 3)
sd_low_out
## [1] -69
# Calculating upper outer oulier fence
sd_upp_out <- 167 + (sdqr * 3)
sd_upp_out
## [1] 344
# Calculates the number of outliers below the inner lower fence (mild outliers)
sum(ww$total.sulfur.dioxide < sd_low_inn)
## [1] 5
# Calculates the number of outliers above the inner upper fence (mild outliers)
sum(ww$total.sulfur.dioxide > sd_upp_inn)
## [1] 14
# Calculates the number of outliers above the outer upper fence (extreme outliers)
sum(ww$total.sulfur.dioxide > sd_upp_out)
## [1] 2
# Calculates the value for the 99th percentile
sdninetynine <- quantile(ww$total.sulfur.dioxide, 0.99)
# Calculates the number of values greater than the 99th percentile
sum(ww$total.sulfur.dioxide > sdninetynine)
## [1] 49
Results:
The reason I calculate the 99th percentile and find values above it is to get an idea of how that compares to the number of outliers there are. In this case, if I were to limit data to the 99th percentile (which is something we were taught), I would be removing data from the 4th quartile. Compared to the number of values in the dataset (4898) it’s not very much, but it’s I think it’s good to know.
Residual sugar shows the amount of sugar remaining after fermentation stops; it’s rare to find wines with less than 1 gram/liter, and wines with greater than 45 grams/liter are considered sweet.
catrs1 <- ggplot(ww, aes(x = residual.sugar)) +
geom_histogram(binwidth = .5, color = 'black', fill = 'steelblue1') +
scale_x_continuous(breaks = seq(0, 65, 5)) +
geom_vline(aes(xintercept = mean(residual.sugar)), col='red', size = 1) +
labs(title = 'residual.sugar Histogram w/Mean',
subtitle = 'All Data') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
catrs2 <- ggplot(ww, aes(x = residual.sugar)) +
geom_histogram(aes(y=..density..),
binwidth = .2,
color = 'black',
fill = 'steelblue1',
alpha=.7) +
geom_density(alpha=.2, fill="maroon1") +
scale_x_continuous(breaks = seq(0, 32, 2),
lim = c(0, quantile(ww$residual.sugar, .99))) +
geom_vline(aes(xintercept = mean(residual.sugar)), col='red', size = 1) +
labs(
title = 'residual.sugar Histogram + Density Overlay w/Mean',
subtitle = 'Top 1% Excluded') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
catrs3 <- ggplot(ww, aes(x = "", y = ww$residual.sugar)) +
geom_boxplot(fill = 'steelblue1') +
scale_y_continuous(breaks = seq(0, 65, 5)) +
coord_flip() +
xlab('residual.sugar') +
ylab('count') +
geom_hline(aes(yintercept = mean(residual.sugar)), col='red', size = 1) +
labs(
title = 'residual.sugar Boxplot w/Mean',
subtitle = 'All Data') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
catrs4 <- ggplot(ww, aes(x = residual.sugar)) +
geom_histogram(aes(y=..density..),
binwidth = .02,
color = 'black',
fill = 'steelblue1',
alpha=.7) +
geom_density(alpha=.2, fill="maroon1") +
scale_x_log10(breaks = seq(0, 40, 2)) +
geom_vline(aes(xintercept = mean(residual.sugar)), col='red', size = 1) +
labs(
title = 'residual.sugar Histogram + Density Overlay w/Mean',
subtitle = 'Scaled to log10') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
grid.arrange(catrs1, catrs2, catrs3, catrs4, ncol = 1)
summary(ww$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Residual Sugar Summary: The residual sugar attribute has such an extreme right skew along with some large outliers that I plotted it in two additional ways. First, I excluded the top 1% of the data by using a 99% quantile as a limit for the second plot. Second, I took the log10 scale of the full data, which is the bottom plot. There is a clear bimodal, possibly even multimodal, distribution that can be seen somewhat in the excluded plot, but is very pronounced in the scaled plot. From 0 to 2, there are some extreme spikes between bins. It is interesting to note that by their description of what is considered a sweet wine (> 45 grams/liter), only one wine in the dataset qualifies, and it is the highest outlier for this variable.
These are calculated in order to be used later with bivariate and multivariate plots.
# Gets the first quantile
rsq1 <- quantile(ww$residual.sugar, 0.25)
rsq1
## 25%
## 1.7
# Gets the third quantile
rsq3 <- quantile(ww$residual.sugar, 0.75)
rsq3
## 75%
## 9.9
# Calculates the IQR
rsqr <- IQR(ww$residual.sugar)
rsqr
## [1] 8.2
# Calculating lower and upper inner outlier fences
rs_low_inn <- 1.7 - (rsqr * 1.5)
rs_upp_inn <- 9.9 + (rsqr * 1.5)
rs_upp_inn
## [1] 22.2
# Calculating lower and upper outer oulier fences
rs_low_out <- 1.7 - (rsqr * 3)
rs_upp_out <- 9.9 + (rsqr * 3)
rs_upp_out
## [1] 34.5
# Calculates the number of outliers above the inner upper fence (mild outliers)
sum(ww$residual.sugar > rs_upp_inn)
## [1] 7
# Calculates the number of outliers above the outer upper fence (extreme outliers)
sum(ww$residual.sugar > rs_upp_out)
## [1] 1
# Calculates the value for the 99th percentile
ninetynine <- quantile(ww$residual.sugar, 0.99)
# Calculates the number of values greater than the 99th percentile
sum(ww$residual.sugar > ninetynine)
## [1] 47
Results:
Chlorides is the amount of salt in the wine.
catch1 <- ggplot(ww, aes(x = chlorides)) +
geom_histogram(binwidth = .0025, color = 'black', fill = 'steelblue1') +
geom_vline(aes(xintercept = mean(chlorides)), col='red', size = 1) +
labs(title = 'chlorides Histogram w/Mean') +
theme(
plot.title = element_text(size=12, face="italic"))
catch2 <- ggplot(ww, aes(x = chlorides)) +
geom_histogram(binwidth = .0025, color = 'black', fill = 'steelblue1') +
scale_x_continuous(breaks = seq(0, .2, .025),
lim = c(0, quantile(ww$chlorides, .99))) +
geom_vline(aes(xintercept = mean(chlorides)), col='red', size = 1) +
labs(title = 'chlorides Histogram w/Mean',
subtitle = 'Top 1% Excluded') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
catch3 <- ggplot(ww, aes(x = chlorides)) +
geom_histogram(aes(y=..density..),
binwidth = .0025,
color = 'black',
fill = 'steelblue1',
alpha=.7) +
geom_density(alpha=.2, fill="maroon1") +
scale_x_continuous(breaks = seq(0, .2, .025),
lim = c(0, quantile(ww$chlorides, .99))) +
geom_vline(aes(xintercept = mean(chlorides)), col='red', size = 1) +
labs(
title = 'chlorides Histogram + Density Plot Overlay w/Mean',
subtitle = 'Top 1% Excluded') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
catch4 <- ggplot(ww, aes(x = "", y = ww$chlorides)) +
geom_boxplot(fill = 'steelblue1') +
scale_y_continuous(breaks = seq(0, .2, .02),
lim = c(0, quantile(ww$chlorides, .99))) +
coord_flip() +
xlab('chlorides') +
ylab('count') +
geom_hline(aes(yintercept = mean(chlorides)), col='red', size = 1) +
labs(title = 'chlorides Boxplot w/Mean',
subtitle = 'Top 1% Excluded') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
grid.arrange(catch1, catch2, catch3, catch4, ncol = 1)
summary(ww$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Chlorides Summary: The chlorides attribute has extremely small values. It resembles a normal distribution, and the density overlay shows this to be a bimodal distribution. It has a long right tail full of outliers. Even when removing the top 1% of the data there are still a lot of outliers in these plots. If I excluded the top 5%, it would take it down to around 0.066, which would then be eliminating data from the 4th quartile.
These are calculated in order to be used later with bivariate and multivariate plots.
# Gets the first quantile
clq1 <- quantile(ww$chlorides, 0.25)
clq1
## 25%
## 0.036
# Gets the third quantile
clq3 <- quantile(ww$chlorides, 0.75)
clq3
## 75%
## 0.05
# Calculates the IQR
clqr <- IQR(ww$chlorides)
clqr
## [1] 0.014
# Calculating lower inner outlier fence
cl_low_inn <- 0.036 - (clqr * 1.5)
cl_low_inn
## [1] 0.015
# Calculating upper inner outlier fence
cl_upp_inn <- 0.05 + (clqr * 1.5)
cl_upp_inn
## [1] 0.071
# Calculating lower outer oulier fence
cl_low_out <- 0.036 - (clqr * 3)
cl_low_out
## [1] -0.006
# Calculating upper outer oulier fence
cl_upp_out <- 0.05 + (clqr * 3)
cl_upp_out
## [1] 0.092
# Calculates the number of outliers below the inner lower fence (mild outliers)
sum(ww$chlorides < cl_low_inn)
## [1] 7
# Calculates the number of outliers above the inner upper fence (mild outliers)
sum(ww$chlorides > cl_upp_inn)
## [1] 201
# Calculates the number of outliers above the outer upper fence (extreme outliers)
sum(ww$chlorides > cl_upp_out)
## [1] 131
# Calculates the value for the 95th percentile
clninetyfive <- quantile(ww$chlorides, 0.95)
clninetyfive
## 95%
## 0.067
# Calculates the value for the 99th percentile
clninetynine <- quantile(ww$chlorides, 0.99)
# Calculates the number of values greater than the 95th percentile
sum(ww$chlorides > clninetyfive)
## [1] 237
# Calculates the number of values greater than the 99th percentile
sum(ww$chlorides > clninetynine)
## [1] 48
Results:
Volatile acidity is the amount of acetic acid in wine, which can lead to an unpleasant vinegar taste at too high of levels.
catva1 <- ggplot(ww, aes(x = volatile.acidity)) +
geom_histogram(binwidth = .01, color = 'black', fill = 'steelblue1') +
scale_x_continuous(breaks = seq(0, 1, .1)) +
geom_vline(aes(xintercept = mean(volatile.acidity)), col='red', size = 1) +
labs(title = 'volatile.acidity Histogram w/Mean') +
theme(
plot.title = element_text(size=12, face="italic"))
catva2 <- ggplot(ww, aes(x = volatile.acidity)) +
geom_histogram(aes(y=..density..),
binwidth = .01,
color = 'black',
fill = 'steelblue1',
alpha=.7) +
geom_density(alpha=.2, fill="maroon1") +
scale_x_continuous(breaks = seq(0, 1, .1)) +
geom_vline(aes(xintercept = mean(volatile.acidity)), col='red', size = 1) +
labs(title = 'volatile.acidity Histogram + Density Plot Overlay w/Mean') +
theme(
plot.title = element_text(size=12, face="italic"))
catva3 <- ggplot(ww, aes(x = "", y = ww$volatile.acidity)) +
geom_boxplot(fill = 'steelblue1') +
scale_y_continuous(breaks = seq(0, 1, .1)) +
coord_flip() +
xlab('volatile.acidity') +
ylab('count') +
geom_hline(aes(yintercept = mean(volatile.acidity)), col='red', size = 1) +
labs(title = 'volatile.acidity Boxplot w/Mean') +
theme(
plot.title = element_text(size=12, face="italic"))
grid.arrange(catva1, catva2, catva3, ncol = 1)
summary(ww$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Volatile Acidity Summary: The volatile acidity attribute has a fairly normal distribution but with a long right tail mostly made up of outliers. This can be seen in the summary, where the 3rd quartile is 0.32, and the max is 1.1.
These are calculated in order to be used later with bivariate and multivariate plots.
# Gets the first quantile
vaq1 <- quantile(ww$volatile.acidity, 0.25)
vaq1
## 25%
## 0.21
# Gets the third quantile
vaq3 <- quantile(ww$volatile.acidity, 0.75)
vaq3
## 75%
## 0.32
# Calculates the IQR
vaqr <- IQR(ww$volatile.acidity)
vaqr
## [1] 0.11
# Calculating lower inner outlier fence
va_low_inn <- 0.21 - (vaqr * 1.5)
va_low_inn
## [1] 0.045
# Calculating upper inner outlier fence
va_upp_inn <- 0.32 + (vaqr * 1.5)
va_upp_inn
## [1] 0.485
# Calculating lower outer oulier fence
va_low_out <- 0.21 - (vaqr * 3)
va_low_out
## [1] -0.12
# Calculating upper outer oulier fence
va_upp_out <- 0.32 + (vaqr * 3)
va_upp_out
## [1] 0.65
# Calculates the number of outliers below the inner lower fence (mild outliers)
sum(ww$volatile.acidity < va_low_inn)
## [1] 0
# Calculates the number of outliers above the inner upper fence (mild outliers)
sum(ww$volatile.acidity > va_upp_inn)
## [1] 186
# Calculates the number of outliers above the outer upper fence (extreme outliers)
sum(ww$volatile.acidity > va_upp_out)
## [1] 39
# Calculates the value for the 99th percentile
vaninetynine <- quantile(ww$volatile.acidity, 0.99)
# Calculates the number of values greater than the 99th percentile
sum(ww$volatile.acidity > vaninetynine)
## [1] 48
Results:
The density of water is close to that of water depending on the percent alcohol and sugar content.
catd1 <- ggplot(ww, aes(x = density)) +
geom_histogram(binwidth = .0005, color = 'black', fill = 'steelblue1') +
scale_x_continuous(breaks = seq(0, 1.04, .005)) +
geom_vline(aes(xintercept = mean(density)), col='red', size = 1) +
labs(title = 'density Histogram w/Mean',
subtitle = 'All Data') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
catd2 <- ggplot(ww, aes(x = density)) +
geom_histogram(aes(y=..density..),
binwidth = .0005,
color = 'black',
fill = 'steelblue1',
alpha=.7) +
geom_density(alpha=.2, fill="maroon1") +
scale_x_continuous(breaks = seq(0, 1.04, .005)) +
geom_vline(aes(xintercept = mean(density)), col='red', size = 1) +
labs(title = 'density Histogram + Density Plot Overlay w/Mean',
subtitle = 'All Data') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
catd3 <- ggplot(ww, aes(x = "", y = ww$density)) +
geom_boxplot(fill = 'steelblue1', outlier.alpha = 0.3) +
scale_y_continuous(breaks = seq(0, 1.04, .005)) +
coord_flip() +
xlab('density') +
ylab('count') +
geom_hline(aes(yintercept = mean(density)), col='red', size = 1) +
labs(title = 'density Boxplot w/Mean',
subtitle = 'All Data') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
catd4 <- ggplot(ww, aes(x = density)) +
geom_histogram(aes(y=..density..),
binwidth = .0002,
color = 'black',
fill = 'steelblue1',
alpha=.7) +
geom_density(alpha=.2, fill="maroon1") +
scale_x_continuous(lim = c(.985, 1.005)) +
geom_vline(aes(xintercept = mean(density)), col='red', size = 1) +
labs(title = 'density Histogram + Density Plot Overlay w/Mean',
subtitle = 'Data Restricted') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
grid.arrange(catd1, catd2, catd3, catd4, ncol = 1)
summary(ww$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density Summary: The density attribute has a multimodal distribution. There are approximately 6 outliers that create an extreme right tail, but applying a log10 scale does nothing to help visualize the distribution. Instead, I simply set breaks to limit the plot to the bulk of the data, which can be seen at the bottom. Attempting to limit by a 99% quartile excluded usable data from the plot.
These are calculated in order to be used later with bivariate and multivariate plots.
# Gets the first quantile
dq1 <- quantile(ww$density, 0.25)
vaq1
## 25%
## 0.21
# Gets the third quantile
dq3 <- quantile(ww$density, 0.75)
dq3
## 75%
## 0.9961
# Calculates the IQR
dqr <- IQR(ww$density)
dqr
## [1] 0.0043775
# Calculating lower inner outlier fence
d_low_inn <- 0.21 - (dqr * 1.5)
d_low_inn
## [1] 0.2034338
# Calculating upper inner outlier fence
d_upp_inn <- 0.9961 + (dqr * 1.5)
d_upp_inn
## [1] 1.002666
# Calculating lower outer oulier fence
d_low_out <- 0.21 - (dqr * 3)
d_low_out
## [1] 0.1968675
# Calculating upper outer oulier fence
d_upp_out <- 0.9961 + (dqr * 3)
d_upp_out
## [1] 1.009232
# Calculates the number of outliers below the inner lower fence (mild outliers)
sum(ww$density < d_low_inn)
## [1] 0
# Calculates the number of outliers above the inner upper fence (mild outliers)
sum(ww$density > d_upp_inn)
## [1] 5
# Calculates the number of outliers above the outer upper fence (extreme outliers)
sum(ww$density > d_upp_out)
## [1] 3
# Calculates the value for the 99th percentile
dninetynine <- quantile(ww$density, 0.99)
# Calculates the number of values greater than the 99th percentile
sum(ww$density > dninetynine)
## [1] 49
Results:
The main feature is quality. The goal of this analysis is to determine which attributes would be useful in predicting a quality score.
Based on the provided description of the attributes, I want to look at alcohol, total sulfur dioxide, sugar, chlorides, volatile acidity, and density. I feel like these may have some influence on taste.
Yes, I created a variable called ‘rating’. See the Additional Variables sub section under the Data Exploration section at the top of this document.
The residual sugar distribution was the most unusual with a heavy right skew. I scaled it to log10 in order to get a clearer understanding of it. This helps to reveal the smaller values that are highly compressed in the original plot.
Before I start plotting any bivariate data, I want to see how the attributes are correlated with each other. For that, I created a heatmap with correlation statistics.
# Keeps only variables with a numeric data type, and using the 'no_x' dataframe
cormat <- no_x %>% keep(is.numeric)
# Rounds all the values to 2 decimals
cormat <- round(cor(cormat), 2)
# Melts the dataframe so there are 2 variables
melted_cormat <- melt(cormat)
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat) {
cormat[lower.tri(cormat)] <- NA
return(cormat)
}
# Put the upper triangle of the correlation into a variable
upper_tri <- get_upper_tri(cormat)
# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Create a ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "dodgerblue3",
high = "#FF6363",
mid = "gray95",
midpoint = 0, limit = c(-1,1),
space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, size = 11, hjust = 1),
axis.text.y = element_text(size = 11)) +
coord_fixed()
# Theme the heatmap for appearances
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.5, 0.75),
legend.direction = "horizontal") +
guides(fill = guide_colorbar(barwidth = 9, barheight = 1,
title.position = "top",
title.hjust = 0.5))
Observations: From this plot, I can see that alcohol has the highest correlation to quality, though at 0.44 I would consider it only a moderate correlation. The second highest is pH, but at 0.1 I would consider that weak. Both residual sugar and chlorides are even weaker, both with negative values for correlation to quality.
Using boxplots I am going to explore the relationship between quality and the other variables I believe might have an effect on the quality score.
bpm1 <- ggplot(ww, aes(x = score, y = alcohol)) +
geom_boxplot(fill = 'steelblue1') +
labs(title = 'Alcohol % by volume') +
xlab('score') +
theme_bw() +
theme(
plot.title = element_text(size=10, face="italic"))
bpm2 <- ggplot(ww, aes(x = score, y = total.sulfur.dioxide)) +
geom_boxplot(fill = 'steelblue1') +
labs(title = 'total.sulfur.dioxide mg / dm^3') +
xlab('score') +
theme_bw() +
theme(
plot.title = element_text(size=10, face="italic"))
bpm3 <- ggplot(ww, aes(x = score, y = residual.sugar)) +
geom_boxplot(fill = 'steelblue1') +
labs(title = 'Residual Sugar g / dm^3') +
xlab('score') +
theme_bw() +
theme(
plot.title = element_text(size=10, face="italic"))
bpm4 <- ggplot(ww, aes(x = score, y = chlorides)) +
geom_boxplot(fill = 'steelblue1') +
labs(title = 'Chlorides (salt) g / dm^3') +
xlab('score') +
theme_bw() +
theme(
plot.title = element_text(size=10, face="italic"))
bpm5 <- ggplot(ww, aes(x = score, y = volatile.acidity)) +
geom_boxplot(fill = 'steelblue1') +
labs(title = 'Volatile Acidity (acetic acid) g / dm^3') +
xlab('score') +
theme_bw() +
theme(
plot.title = element_text(size=10, face="italic"))
bpm6 <- ggplot(ww, aes(x = score, y = density)) +
geom_boxplot(fill = 'steelblue1') +
labs(title = 'Density g / cm^3') +
xlab('score') +
theme_bw() +
theme(
plot.title = element_text(size=10, face="italic"))
grid.arrange(bpm1, bpm2, bpm3, bpm4, bpm5, bpm6, ncol = 2,
top=textGrob("Variable Boxplots - All Data",
gp=gpar(fontsize=15, font=2)))
The outliers, particularly in in the bottom four plots make it difficult to determine what effect, if any, they have on quality. To get a better understanding, I’ll remove some outliers. The outlier fences I calculated for each variable in the univariate section will be useful for this.
bpno1 <- ggplot(ww, aes(x = score, y = alcohol)) +
geom_boxplot(fill = 'steelblue1') +
geom_smooth(method = "lm", se=FALSE, color="red", aes(group=1)) +
scale_y_continuous(lim = c(8, al_upp_inn)) +
xlab('score') +
labs(title = 'Alcohol % by volume') +
theme_bw() +
theme(
plot.title = element_text(size=10, face="italic"))
bpno2 <- ggplot(ww, aes(x = score, y = total.sulfur.dioxide)) +
geom_boxplot(fill = 'steelblue1') +
geom_smooth(method = "lm", se=FALSE, color="red", aes(group=1)) +
scale_y_continuous(lim = c(sd_low_inn, sd_upp_inn)) +
xlab('score') +
labs(title = 'Total Sulfur Dioxide mg / dm^3') +
theme_bw() +
theme(
plot.title = element_text(size=10, face="italic"))
bpno3 <- ggplot(ww, aes(x = score, y = residual.sugar)) +
geom_boxplot(fill = 'steelblue1') +
geom_smooth(method = "lm", se=FALSE, color="red", aes(group=1)) +
scale_y_continuous(lim = c(0, rs_upp_inn)) +
xlab('score') +
labs(title = 'Residual Sugar g / dm^3') +
theme_bw() +
theme(
plot.title = element_text(size=10, face="italic"))
bpno4 <- ggplot(ww, aes(x = score, y = chlorides)) +
geom_boxplot(fill = 'steelblue1') +
geom_smooth(method = "lm", se=FALSE, color="red", aes(group=1)) +
scale_y_continuous(lim = c(cl_low_inn, cl_upp_inn)) +
xlab('score') +
labs(title = 'Chlorides (salt) g / dm^3') +
theme_bw() +
theme(
plot.title = element_text(size=10, face="italic"))
bpno5 <- ggplot(ww, aes(x = score, y = volatile.acidity)) +
geom_boxplot(fill = 'steelblue1') +
geom_smooth(method = "lm", se=FALSE, color="red", aes(group=1)) +
scale_y_continuous(lim = c(quantile(ww$volatile.acidity, 0), va_upp_inn)) +
xlab('score') +
labs(title = 'Volatile Acidity (acetic acid) g / dm^3') +
theme_bw() +
theme(
plot.title = element_text(size=10, face="italic"))
bpno6 <- ggplot(ww, aes(x = score, y = density)) +
geom_boxplot(fill = 'steelblue1') +
geom_smooth(method = "lm", se=FALSE, color="red", aes(group=1)) +
scale_y_continuous(lim = c(quantile(ww$density, 0), d_upp_inn)) +
xlab('score') +
labs(title = 'Density g / cm^3') +
theme_bw() +
theme(
plot.title = element_text(size=10, face="italic"))
grid.arrange(bpno1, bpno2, bpno3, bpno4, bpno5, bpno6, ncol = 2,
top=textGrob("Variable Boxplots - Outliers Reduced",
gp=gpar(fontsize=15, font=2)))
With the outliers reduced, it is easier to interpret the data. I added a regression line to provide a better sense of movement between quality scores.
Alcohol: Initially the values go down, but starting at an average rating, as alcohol by volume goes up, so does the rating and quality scores, indicating that alcohol may play a role in better tasting wines.
Total Sulfur Dioxide: The interquartile ranges get smaller as quality goes up. The regression line is mildly negative.
Residual sugar: This is harder to interpret. The median values and interquartile ranges vary wildly, but the regression line shows a downward trend. I want to look at this further.
Chlorides (salt): Interestingly, as quality goes up, salt content seems to go down, indicating a negative relationship between salt and better tasting wines. This confirms the correlation to quality of -0.21. However, that is a weak correlation, and the chart may be over-emphasizing this negative relationship.
Volatile acidity: The values start going up initially, but then they go down, followed by a slight recovery. There is little relationship with quality, as is confirmed by the correlation of -0.19.
Density: The overall trend seems that density goes down as quality goes up, confirmed by the correlation of -0.31.
Because the quality variable is discrete, a scatterplot would only show vertical lines. Instead, I am going to try a dispersion plot to get a better sense of the data. I also added boxplot frames over the top to provide a point of reference, otherwise you would just see a column of dots.
ggplot(ww, aes(x = score, y = alcohol)) +
geom_point(aes(color = score), position = 'jitter', size = .8) +
geom_boxplot(alpha = .1, lwd = .3) +
scale_y_continuous(lim = c(8, al_upp_inn)) +
labs(title = 'Alcohol % by volume',
subtitle = 'Alcohol Outliers Removed') +
scale_color_manual(
values=c("red4", "red", "orange", "springgreen3",
"dodgerblue", "blue", "dodgerblue4"),
guide = guide_legend(
title = 'score',
override.aes = list(alpha = 1, size = 5))) +
theme(plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
This plot isn’t as useful with this dataset. There are so many values in scores of 5, 6, and 7 that even with the boxplot overlay, it’s difficult to glean any meaning from this, other than a vague idea of sheer volume. This type of plot works better with fewer values, like scores 3, 4, 8, and 9. Therefore, I did not create further dispersion plots.
Below I’m looking into relationships between the attributes of interest against each other.
ggplot(ww, aes(x = alcohol, y = density)) +
geom_point(color = 'dodgerblue1', alpha = .3) +
stat_smooth(method = 'auto', color = 'black') +
stat_smooth(method = 'lm', color = 'red') +
labs(title = 'Alcohol vs Density',
subtitle = 'All Data') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
ggplot(ww, aes(x = alcohol, y = density)) +
geom_point(color = 'dodgerblue1', alpha = .3) +
stat_smooth(method = 'auto', color = 'black') +
stat_smooth(method = 'lm', color = 'red') +
scale_y_continuous(lim = c(min(ww$density), d_upp_inn)) +
labs(title = 'Alcohol vs Density',
subtitle = 'Density Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
I already know that there is a strong negative correlation between density and alcohol at -0.78, and from the description provided on density this is logical. I included two conditional means lines, red for linear, black for auto, which is GAM (Generalized Additive Model). They are pretty close together, meaning the data is fairly well centered.
## ggMarginal doesn't seem to play nice with RStudio; no plot is shown, but it is in the html file.
g2 <- ggplot(ww, aes(x = alcohol, y = residual.sugar)) +
geom_point(color = 'dodgerblue1', alpha = .5) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'red') +
labs(title = 'Alcohol vs Residual Sugar',
subtitle = 'All Data') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
ggMarginal(g2, type = "density", size = 9, fill="steelblue1")
## ggMarginal doesn't seem to play nice with RStudio; no plot is shown, but it is in the html file.
g3 <- ggplot(ww, aes(x = alcohol, y = residual.sugar)) +
geom_point(color = 'dodgerblue1', alpha = .5) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'red') +
scale_y_continuous(breaks = seq(0, 25, 5),
lim = c(0, rs_upp_inn)) +
labs(title = 'Alcohol vs Residual Sugar',
subtitle = 'Residual Sugar Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
ggMarginal(g3, type = "density", size = 9, fill="steelblue1")
These scatter plots are more challenging to decipher, especially the second one where I removed the outliers, so I added density plots in the margins to help get a better understanding.
Sugars clearly go down as alcohol goes up, which is logical since sugars are consumed in the fermentation process. From the heatmap, I know that residual sugar has a correlation of -0.45, which is a moderate (negative) relationship with alcohol. The two conditional means lines do not line up at all, so the center of the data is not linear. I added density plots in the margin to help visualize the skew of the data; recall that alcohol had a skew, but not nearly as severe as residual sugar did, and both were multimodal.
ggplot(ww, aes(x = alcohol, y = total.sulfur.dioxide)) +
geom_point(color = 'dodgerblue1', alpha = .3) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'red') +
labs(title = 'Alcohol vs Total Sulfur Dioxide') +
theme(
plot.title = element_text(size=12, face="italic"))
ggplot(ww, aes(x = alcohol, y = total.sulfur.dioxide)) +
geom_point(color = 'dodgerblue1', alpha = .3) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'red') +
scale_y_continuous(breaks = seq(0, 250, 50),
lim = c(sd_low_inn, sd_upp_inn)) +
labs(title = 'Alcohol vs Total Sulfur Dioxide',
subtitle = 'Sulfur Dioxide Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
Total sulfur dioxide has the same correlation to alcohol as residual sugar does at -0.45 but what’s interesting about that is the distibution of the points are very different from sugar. They are much more compact at the lower alcohol levels, and the conditional means lines demonstrate this difference.
ggplot(ww, aes(x = alcohol, y = chlorides)) +
geom_point(color = 'dodgerblue1', alpha = .3) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'red') +
labs(title = 'Alcohol vs Chlorides',
subtitle = 'All Data') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
ggplot(ww, aes(x = alcohol, y = chlorides)) +
geom_point(color = 'dodgerblue1', alpha = .3) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'red') +
scale_y_continuous(lim = c(cl_low_inn, cl_upp_inn)) +
labs(title = 'Alcohol vs Chlorides',
subtitle = 'Chloride Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
There is a similar negative correlation of chlorides to alcohol at -0.36. I do not know if salt is consumed in the fermenation process of wine, but I am aware of other things that are fermented using salt, so it is reasonble to assume so here. The conditional means lines are very close together, indicating the data is reasonably well centered.
We were asked to explore the two variables with the hightest correlation.
ggplot(ww, aes(x = residual.sugar, y = density)) +
geom_point(color = 'dodgerblue1', alpha = .3) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'red') +
labs(title = 'Residual Sugar vs Density') +
theme(
plot.title = element_text(size=12, face="italic"))
ggplot(ww, aes(x = residual.sugar, y = density)) +
geom_point(color = 'dodgerblue1', alpha = .3, show.legend=F) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'red') +
scale_x_continuous(limits = c(min(ww$residual.sugar), rs_upp_inn)) +
scale_y_continuous(limits = c(min(ww$density), d_upp_inn)) +
labs(title = 'Residual Sugar vs Density',
subtitle = 'All Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face='italic'))
The highest correlation of any two variables in the dataset was between residual sugar and density at 0.84. This is clearly visible in the plots. The description provided on density state that sugar is one of the things that can affect it. The conditional means lines are very close together, so the data is well centered.
ggplot(ww, aes(x = residual.sugar, y = total.sulfur.dioxide)) +
geom_point(color = 'dodgerblue1', alpha = .3) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'red') +
labs(title = 'Residual Sugar vs Total Sulfur Dioxide') +
theme(
plot.title = element_text(size=12, face="italic"))
tsd <- ggplot(ww, aes(x = residual.sugar, y = total.sulfur.dioxide)) +
geom_point(color = 'dodgerblue1', alpha = .3, show.legend=F) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'red') +
scale_x_continuous(lim = c(0, rs_upp_inn)) +
scale_y_continuous(lim = c(sd_low_inn, sd_upp_inn)) +
labs(title = 'Residual Sugar vs Total Sulfur Dioxide',
subtitle = 'All Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
ggMarginal(tsd, type = "density", size = 8, fill="steelblue1")
Total sulfur dioxide has a moderate correlation to residual sugar at 0.4. The conditional means lines don’t quite line up; recall that the log10 of residual sugar showed a multimodal distribution and those fluctuations in the data move the auto line in a wavy pattern. I included density plots in the margins to show the distribution of the data in each attribute, but a clearer picture can be seen in the residual sugar univariate plots near the top of this document.
ggplot(ww, aes(x = total.sulfur.dioxide, y = density)) +
geom_point(color = 'dodgerblue1', alpha = .3) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'red') +
labs(title = 'Total Sulfur Dioxide vs Density') +
theme(
plot.title = element_text(size=12, face="italic"))
ggplot(ww, aes(x = total.sulfur.dioxide, y = density)) +
geom_point(color = 'dodgerblue1', alpha = .3, show.legend=F) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'red') +
scale_x_continuous(limits = c(sd_low_inn, sd_upp_inn)) +
scale_y_continuous(limits = c(min(ww$density), d_upp_inn)) +
labs(title = 'Total Sulfur Dioxide vs Density',
subtitle = 'All Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
Total sulfur dioxide and density have a moderate correlation of 0.53. The conditional means lines are pretty close together except at the very beginning, and the sudden downward curve of the auto line in the first plot is due to outliers in the 250-450 range of the x-axis. It is unclear if these two attributes have this correlation due to an actual interaction between them, or if it is coincidental, and there are other attributes that affect each individually. For example, I know density is affected by sugar and alcohol content, but I don’t know if total sulfur dioxide is affected by any of them.
ggplot(ww, aes(alcohol)) +
geom_bar(aes(fill=rating),
binwidth = .1,
col="black",
size=.1) +
scale_fill_manual(
values=c("red4", "red", "springgreen2", "dodgerblue", "purple")) +
labs(title="Alcohol Faceted by Rating",
subtitle="Full Data") +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
top <- subset(ww, quality > 6)
average <- subset(ww, quality == 6)
ggplot(top, aes(alcohol)) +
geom_bar(aes(fill=rating),
binwidth = .1,
col="black",
size=.1) +
scale_fill_manual(
values=c("dodgerblue", "purple", "red")) +
labs(title="Alcohol Faceted by Rating",
subtitle="Limited to Scores Above Average") +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
These plots show where ratings fall within each percent of alcohol. Lower alcohol amounts have higher bad and very bad ratings, while higher alcohol amounts have higher good and very good ratings.
I subsetted the data to include only ratings above average because those counts are lower and harder to see in the full data. I created the same plot with that dataset.
ggplot(ww, aes(residual.sugar)) +
geom_bar(aes(fill=rating),
binwidth = .3,
col="black",
size=.1) +
scale_x_continuous(limits = c(0, rs_upp_inn)) +
scale_fill_manual(
values=c("red4", "red", "springgreen2", "dodgerblue", "purple")) +
labs(title="Residual Sugar Faceted by Rating",
subtitle="Residual Sugar Outliers Removed") +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
ggplot(ww, aes(residual.sugar)) +
geom_bar(aes(fill=rating),
binwidth = .03,
col="black",
size=.1) +
scale_x_log10() +
scale_fill_manual(
values=c("red4", "red", "springgreen2", "dodgerblue", "purple")) +
labs(title="Residual Sugar Faceted by Rating",
subtitle=expression(Log[10]~Scale~-~Full~Data)) +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"))
In the first plot, even though outliers are removed, it’s a little difficult to see some of the details because the massive spike on the left side compresses the rest of the data in the plot. The second plot is scaled to log10 and provides better detail. It shows that wines of all ratings are represented similarly throughout the entire plot. Rating is based on quality, and residual sugar had a correlation with quality at -0.1, so it makes sense that all ratings are represented through the entire range.
When looking at relationships with quality, the highest correlation is with alcohol at 0.44, but this is really only a moderate correlation. The rest of the variables have an even less moderate to weak correlation to quality. Density has the strongest negative correlation at -0.31, but this too is weak-moderate. This would seem to indicate that alcohol is the best predictor of wine quality, but with a correlation that is moderate, it may be the best predictor in the dataset, but not a good predictor overall.
Since alcohol has the strongest relationship with quality, I looked at what attributes have the strongest correlation to alcohol. Density has the strongest, albeit negative correlation, followed by residual sugar and total sulfur dioxide, both also negative. But density seems to be a dependent variable since it is affected by both alcohol and sugar. This is a somewhat dichotomous relationship between the three: sugar is consumed in the fermentation process that creates alcohol; as alcohol goes up, density goes down. So the less sugar consumed in/available for fermentation, the less alcohol created, so density goes up.
Residual sugar and density had the highest correlation of the entire dataset at 0.84.
In this section, I will be using some of the plots in the bivariate section and introducing a third variable to explore the data even further. I’m mainly going to focus on alcohol against other variables since that has the highest correlation to the quality score.
ggplot(ww, aes(x = alcohol, y = residual.sugar, color = rating)) +
geom_point(alpha = .5, show.legend=T) +
geom_smooth(method = 'auto', color = 'black') +
scale_y_continuous(lim = c(0, rs_upp_inn)) +
scale_color_manual(
values=c("red4", "red", "springgreen2", "dodgerblue", "purple"),
guide = guide_legend(override.aes = list(alpha = 1, size = 5))) +
labs(title = 'Alcohol vs Residual Sugar, Faceted by Rating - Scatterplot',
subtitle = 'Residual Sugar Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill="gray90"))
ggplot(ww, aes(x = alcohol, y = residual.sugar, color = rating)) +
geom_count(alpha = .4, show.legend=T) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'magenta') +
scale_y_continuous(lim = c(0, rs_upp_inn)) +
scale_color_manual(
values=c("red4", "red", "springgreen2", "dodgerblue", "purple"),
guide = guide_legend(override.aes = list(alpha = 1, size = 5))) +
labs(title = 'Alcohol vs Residual Sugar, Faceted by Rating - Count Plot',
subtitle = 'Residual Sugar Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill="gray90"))
bad <- subset(ww, quality < 6)
ggplot(bad, aes(x = alcohol, y = residual.sugar, color = rating)) +
geom_count(alpha = .5, show.legend=T) +
scale_x_continuous(breaks = seq(8.5, al_upp_out, 1),
lim = c(8, al_upp_inn)) +
scale_y_continuous(lim = c(0, rs_upp_inn)) +
scale_color_manual(
values=c("red4", "red"),
guide = guide_legend(override.aes = list(alpha = 1, size = 5))) +
labs(title = 'Alcohol vs Residual Sugar, Faceted by Rating - Count Plot',
subtitle = 'Rating Below Average - Residual Sugar Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill="gray90"))
ggplot(average, aes(x = alcohol, y = residual.sugar, color = rating)) +
geom_count(alpha = .5, show.legend=T) +
scale_x_continuous(breaks = seq(8.5, al_upp_out, 1),
lim = c(8, al_upp_inn)) +
scale_y_continuous(lim = c(0, rs_upp_inn)) +
scale_color_manual(
values=c("springgreen2"),
guide = guide_legend(override.aes = list(alpha = 1, size = 5))) +
labs(title = 'Alcohol vs Residual Sugar, Faceted by Rating - Count Plot',
subtitle = 'Average Rating - Residual Sugar Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill="gray90"))
ggplot(top, aes(x = alcohol, y = residual.sugar, color = rating)) +
geom_count(alpha = .5, show.legend=T) +
scale_x_continuous(breaks = seq(8.5, al_upp_out, 1),
lim = c(8, al_upp_inn)) +
scale_y_continuous(lim = c(0, rs_upp_inn)) +
scale_color_manual(
values=c("dodgerblue", "purple"),
guide = guide_legend(override.aes = list(alpha = 1, size = 5))) +
labs(title = 'Alcohol vs Residual Sugar, Faceted by Rating - Count Plot',
subtitle = 'Rating Above Average - Residual Sugar Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill="gray90"))
Initially I created a scatterplot, but while I was researching I discovered a new type of plot called a count plot. This combines points that overlap and make larger points depending on the amount of data. I found this to be much more informative in this circumstance because 1) there are so many overlapping data points, and 2) because it’s color coded by rating, it helps distinguish ratings better than just a scatterplot.
Overall, because of the strong right skew of residual sugar, there are a lot of values clustered along the bottom with low sugar but are represented by all ratings. This wasn’t as pronounced in the bivariate plot, but the color and the size of the dots help to accentualte this fact.
From an x-axis standpoint, I can see that there are far more bad and very bad rated wines at lower alcohol amounts, and as alcohol goes up, so do the ratings.
From a y-axis standpoint, there are bad and very bad ratings scattered throughout the entire range; however, there is pretty significant cluster of good and very good rated wines with low alcohol and high sugar. There’s also some large clusters of average in that same area as well, which indicates that wines with low alcohol and high sugar are also popular.
ggplot(ww, aes(x = alcohol, y = density, color = rating)) +
geom_count(alpha = .5, show.legend=T) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'magenta') +
scale_y_continuous(lim = c(min(ww$density), d_upp_inn)) +
scale_color_manual(
values=c("red4", "red", "springgreen2", "dodgerblue", "purple"),
guide = guide_legend(override.aes = list(alpha = 1, size = 5))) +
labs(title = 'Alcohol vs Density, Faceted by Rating - Count Plot',
subtitle = 'Density Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill="gray90"))
This plot is as expected; lower quality wines are concentrated on the left with lower alcohol, higher quality wines are concentrated on the right. The downward slope of the data is also expected; I already know that as sugar and density go in the same direction, so as sugar goes down, density goes down. I also know that sugar gets consumed in the fermentation process, so it is logical that as alcohol goes up, density goes down, just as sugar does.
ggplot(ww, aes(x = alcohol, y = chlorides, color = rating)) +
geom_count(alpha = .5, show.legend=T) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'magenta') +
scale_y_continuous(lim = c(cl_low_inn, cl_upp_inn)) +
scale_color_manual(
values=c("red4", "red", "springgreen2", "dodgerblue", "purple"),
guide = guide_legend(override.aes = list(alpha = 1, size = 5))) +
labs(title = 'Alcohol vs Chlorides (Salt) Faceted by Rating - Count Plot',
subtitle = 'Chloride Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill="gray90"))
The outcome is what I expected, and it’s a bit smoother than the previous two plots. As salt goes down, alcohol goes up, and I see more high rated wines as a result. I do wonder however, if salt quantities less than a gram per liter is even detectable by the human palate, especially against the strong taste of alcohol. This would likely be of more interest to a scientist studying salt in wine than this analysis on effects on taste.
ggplot(ww, aes(x = residual.sugar, y = density, color = rating)) +
geom_count(alpha = .4, show.legend=T) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'magenta') +
scale_x_continuous(limits = c(min(ww$residual.sugar), rs_upp_inn)) +
scale_y_continuous(limits = c(min(ww$density), d_upp_inn)) +
scale_color_manual(
values=c("red4", "red", "springgreen2", "dodgerblue", "purple"),
guide = guide_legend(override.aes = list(alpha = 1, size = 5))) +
labs(title = 'Residual Sugar vs Density, Faceted by Rating - Count Plot',
subtitle = 'All Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill="gray90"))
This plot is a little trickier to interpret when faceted by rating. Recall that there is an inverse relationship between residual sugar/density and alcohol. In this plot, the lower rated wines are clustered more above the conditional means line, and higher rated wines are below the line, though there is still that cluster of higher ratings with higher sugar that falls above the line. This is logical since higher residual sugar means higher density and lower alcohol.
ggplot(ww, aes(x = residual.sugar, y = total.sulfur.dioxide, color = rating)) +
geom_count(alpha = .5, show.legend=T) +
geom_smooth(method = 'auto', color = 'black') +
geom_smooth(method = 'lm', color = 'magenta') +
scale_x_continuous(lim = c(min(ww$residual.sugar), rs_upp_inn)) +
scale_y_continuous(lim = c(sd_low_inn, sd_upp_inn)) +
scale_color_manual(
values=c("red4", "red", "springgreen2", "dodgerblue", "purple"),
guide = guide_legend(override.aes = list(alpha = 1, size = 5))) +
labs(title = 'Residual Sugar vs Total Sulfur Dioxide, Faceted by Rating - Count Plot',
subtitle = 'All Outliers Removed') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill="gray90"))
ggplot(ww, aes(x = residual.sugar, y = total.sulfur.dioxide, color = rating)) +
geom_count(alpha = .5, show.legend=T) +
scale_x_continuous(lim = c(min(ww$residual.sugar),
quantile(ww$residual.sugar, .3))) +
scale_y_continuous(lim = c(sd_low_inn, sd_upp_inn)) +
scale_color_manual(
values=c("red4", "red", "springgreen2", "dodgerblue", "purple"),
guide = guide_legend(override.aes = list(alpha = 1, size = 5))) +
labs(title = 'Residual Sugar vs Total Sulfur Dioxide, Faceted by Rating - Count Plot',
subtitle = 'Restrited to the lower 30% of Residual Sugar') +
theme(
plot.title = element_text(size=12, face="italic"),
plot.subtitle = element_text(size=10, face="italic"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill="gray90"))
While I know there is a moderate correlation between total sulfur dioxide and residual sugar, adding rating doesn’t seem to offer much extra information. An argument could be made that there are more bad and very bad ratings above the conditional means lines, but it’s still quite scattered. The severe right skew of residual sugar creates a mass of overlapping data points on the left side of the plot. In the second plot I attempted to zoom in on this area, restricting it to the lower 30% of the data, and even then you can still see significant overlapping of data points from every rating.
In these plots, I focused on alcohol and how variables related to it because alcohol had the highest correlation to quality, positive or negative. Higher quality wines generally tended to have higher alcohol and lower sugar. There also was typically less salt. This is logical since both are part of the fermenation process. Density is dependent on sugar and alcohol, so it could be argued that density could possibly help predict rating, though this would seem almost “second hand.”
All the plotting and analyzing I did in the univariate and bivariate sections led me to believe that the outcomes in this section were mostly as expected. What I found interesting was how many wines with higher sugar and lower alcohol had above average ratings. This indicates that alcohol alone is not the best at predicting rating. Personal preference clearly has to play a role in how someone rates a wine, and it is logical that there are people who prefer a sweeter wine and are less concerned with alcohol content.
This project was challenging, not because the coding was difficult or the topic was hard to figure out, but because there are simply so many options that can be done in R and ggplot, so many things you can change in your code to get different results. I found myself going down a lot of “rabbit holes” on the Internet, trying find out how to do something to get plots to do what visualized in my mind. I was able to achieve most of what I wanted, though at times I felt the outcome was becoming too obvious and I wanted something more challenging that would reveal some big insight once I got to the multivariate plots.
I think the dataset is too limiting to make any good predictions. There are other things that I know factor into the taste of wine such as type of grape, the region it’s grown, and how long the wine is aged, and a variety of other things. I think having that type of information would make for a more interesting and productive analysis.
Dataset: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at:
Histogram Matrix: https://drsimonj.svbtle.com/quick-plot-of-all-variables
Histogram with Density Plot Overlay: http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/
Correlation Matrix: http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-software-and-data-visualization
Ordering Factors: http://www.cookbook-r.com/Manipulating_data/Changing_the_order_of_levels_of_a_factor/
Count Plots: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
Sites I referenced in a variety of ways: